Make rule removal depend on gap in output #22

rikhuijzer · 2023-06-22T13:31:20Z

This PR fixes multiple problems:

Increases the precision of the rank calculation to avoid removing the wrong rules.
Simplifies the calculation for the _feature_space. The old calculation was wrong in some cases (test was added).
Sorts the rules by gap size before removal.
Improves docstrings.
Some refactorings.

Works towards #13. Maybe this PR already finishes #13 because the max_rules=10 scores are very close to the StableForestRegressor scores.

Before

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.81            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.02
  ...
  14 │ make_regression  StableForestRegressor   (n_trees = 1500,)                      10          0.78    0.04
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.26    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.40    0.04
  ...
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.17    0.08
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.23    0.09
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.30    0.08

After

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ..
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ..
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.83            0.03
  ..
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.07
  ..
  14 │ make_regression  StableForestRegressor   (n_trees = 1500,)                      10          0.80    0.05
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.54    0.09
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.66    0.11
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.70    0.08
  ..
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.41    0.07
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.57    0.08
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.65    0.09

rikhuijzer · 2023-06-22T14:27:43Z

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

rikhuijzer · 2023-06-22T14:41:00Z

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

The task for tomorrow is then simple, let's fully work through an example based on the explanation by D.W. Put this example in the Implementation Overview.

rikhuijzer · 2023-06-23T12:38:16Z

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

The task for tomorrow is then simple, let's fully work through an example based on the explanation by D.W. Put this example in the Implementation Overview.

Okay so the difficulty seems to be that a simple approach of converting the rules to a binary feature space is not as easy as thought. Linearly dependent rules will not be guaranteed to show up. Maybe there is an algorithm to automatically find linear dependence while throwing away constraints or otherwise I need to find the bug in my implementation of D.W.'s suggestion.

rikhuijzer · 2023-06-25T17:44:28Z

Improved accuracy slightly after ordering the rules by gap size in 67e3ec6.

Before

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.81            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.02
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.26    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.40    0.04
  ...
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.17    0.08
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.23    0.09
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.30    0.08

After

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ...
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.80            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.68            0.07
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.48    0.07
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.52    0.11
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.55    0.06
  ...
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.34    0.07
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.35    0.10

rikhuijzer · 2023-06-26T08:34:54Z

It looks like improving the rule post-processing step does really improve regression performance. Probably, there is still something wrong which explains why the performance is still so poor. This also explains why earlier I noticed that the rule extraction method didn't affect outcomes so much. It looks now like it does for regression but not for classification

rikhuijzer · 2023-06-26T09:04:44Z

After 65907ae:

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ...
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.84            0.03
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.06
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.64    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.57    0.08
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.66    0.07
  ...
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.43    0.06
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.53    0.08
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.55    0.09

rikhuijzer · 2023-06-26T13:36:04Z

When not simplifying the single rules, then _filter_linearly_dependent might remove too many rules as can be found via

@test length(S._process_rules(repeat(allrules, 34), algo, 9)) == 9

resulting in 8 == 9. So there is still something wrong with the filter.

rikhuijzer · 2023-06-27T08:03:04Z

Localized the following bug in eb74b60. Happens only when the number of repeats is greater or equal to 34:

julia> r1
SIRUS.Rule(TreePath(" X[i, 1] < 32000.0 "), [0.061], [0.408])

julia> r2
SIRUS.Rule(TreePath(" X[i, 1] ≥ 32000.0 "), [0.408], [0.061])

julia> r4
SIRUS.Rule(TreePath(" X[i, 2] ≥ 8000.0 "), [0.386], [0.062])

julia> dependent = S._linearly_dependent([repeat([r2, r1], 34); r4], A, B)
69-element BitVector:
 0
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 0
 1
 1

Here, r4 should definitely not be considered linearly dependent. However, for some reason, there is a one (1) not at the last index but a few indexes before that.

EDIT: Fixed by using rank(A; atol=1e-6).

rikhuijzer added 4 commits June 22, 2023 13:51

Fix rule outputs in tests

3c551af

Document gap

7150fa3

Add rref

cbff84f

Fix tests

97f4f54

rikhuijzer added 3 commits June 23, 2023 08:41

Add tmp.jl

43b63cb

Implement implications

c626037

Create the space

027dcbd

Work through example

6bdbd04

rikhuijzer force-pushed the rh/regression-part-6 branch from aac4e1d to 6bdbd04 Compare June 23, 2023 12:38

rikhuijzer added 2 commits June 25, 2023 19:16

Write a very clear docstring for _feature_space

52be1f9

Improved accuracy slightly

67e3ec6

Slighty better perf again

ea1def0

Improve accuracy further

65907ae

rikhuijzer added 4 commits June 26, 2023 11:52

Getting closer (hopefully)

e78116a

Fix a bug in _feature_space

37c8bfc

Make shuffle work for _filter_linearly_dependent

35ffda9

Something still wrong

0d162ee

rikhuijzer added 3 commits June 27, 2023 07:27

Cleanup

e4c6099

Simplify _filter_linearly_dependent

d50abe5

Localized another bug

eb74b60

rikhuijzer added 2 commits June 27, 2023 10:32

Fix a bug in the rank calculation

20624a5

Cleanup

83f120a

rikhuijzer added 2 commits June 27, 2023 10:42

Also test _process_rules

1d770e7

Remove unnecessary rule

398b974

rikhuijzer enabled auto-merge (squash) June 27, 2023 09:06

rikhuijzer merged commit c608c4e into main Jun 27, 2023
4 checks passed

rikhuijzer deleted the rh/regression-part-6 branch June 27, 2023 09:16

This was referenced Jun 27, 2023

Optimize the rule removal step by filtering flipped versions #21

Closed

Ensure that weights sum to one #6

Closed

rikhuijzer mentioned this pull request Oct 6, 2023

Regression likely contains a bug #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make rule removal depend on gap in output #22

Make rule removal depend on gap in output #22

rikhuijzer commented Jun 22, 2023 •

edited

Loading

rikhuijzer commented Jun 22, 2023 •

edited

Loading

rikhuijzer commented Jun 22, 2023

rikhuijzer commented Jun 23, 2023

rikhuijzer commented Jun 25, 2023 •

edited

Loading

rikhuijzer commented Jun 26, 2023

rikhuijzer commented Jun 26, 2023 •

edited

Loading

rikhuijzer commented Jun 26, 2023 •

edited

Loading

rikhuijzer commented Jun 27, 2023 •

edited

Loading

Make rule removal depend on gap in output #22

Make rule removal depend on gap in output #22

Conversation

rikhuijzer commented Jun 22, 2023 • edited Loading

Before

After

rikhuijzer commented Jun 22, 2023 • edited Loading

rikhuijzer commented Jun 22, 2023

rikhuijzer commented Jun 23, 2023

rikhuijzer commented Jun 25, 2023 • edited Loading

Before

After

rikhuijzer commented Jun 26, 2023

rikhuijzer commented Jun 26, 2023 • edited Loading

rikhuijzer commented Jun 26, 2023 • edited Loading

rikhuijzer commented Jun 27, 2023 • edited Loading

rikhuijzer commented Jun 22, 2023 •

edited

Loading

rikhuijzer commented Jun 22, 2023 •

edited

Loading

rikhuijzer commented Jun 25, 2023 •

edited

Loading

rikhuijzer commented Jun 26, 2023 •

edited

Loading

rikhuijzer commented Jun 26, 2023 •

edited

Loading

rikhuijzer commented Jun 27, 2023 •

edited

Loading